Skip to main content

Data Storage Systems - System Design Guide

A comprehensive guide to data-related storage and processing systems from a system design interview perspective.


Table of Contents

  1. Core Concepts Overview
  2. Storage Systems
  3. Processing Engines
  4. System Design Considerations
  5. Modern Data Architecture Patterns
  6. Interview Checklist
  7. Common Interview Scenarios

1. Core Concepts Overview

ETL (Extract → Transform → Load)

Definition: Data pipeline pattern where data is extracted from sources, transformed (cleaned, aggregated, enriched), and then loaded into the target system.

Purpose:

  • Complex business logic transformations before storage
  • Data quality enforcement at ingestion
  • Schema validation and normalization
  • Data enrichment and integration from multiple sources

When to Use:

  • Transformations are computationally intensive
  • Need to reduce data volume before loading
  • Target system has limited compute capacity
  • Compliance requires data cleansing before storage
  • Legacy systems with limited query capabilities

Typical Tools: Talend, Informatica, AWS Glue, Apache Airflow + Spark, Azure Data Factory

Example Flow:

MySQL DBExtractSpark TransformationLoadSnowflake
Logs (S3)ExtractFilter/AggregateLoadRedshift

Trade-offs:

  • ✅ Reduced storage costs (only transformed data stored)
  • ✅ Better performance for downstream queries
  • ✅ Data quality guaranteed before loading
  • ❌ Longer pipeline latency
  • ❌ Less flexibility for reprocessing
  • ❌ Schema changes require pipeline updates

ELT (Extract → Load → Transform)

Definition: Modern pattern where raw data is loaded first, then transformations run inside the data warehouse/lake.

Purpose:

  • Leverage powerful warehouse compute engines
  • Preserve raw data for flexibility
  • Enable faster ingestion
  • Support schema evolution

When to Use:

  • Using modern cloud warehouses (Snowflake, BigQuery)
  • Need to preserve raw data
  • Transformations might change frequently
  • Multiple teams need different views of same data
  • Real-time or near-real-time requirements

Advantages:

  • Schema-on-read flexibility
  • Raw data always available for reprocessing
  • Faster initial ingestion
  • Easier debugging (raw data visible)
  • Better support for iterative analytics

Example Flow:

API DataLoad to S3Load to SnowflakeTransform with dbt
Kafka StreamLoad to BigQueryTransform with SQL

Trade-offs:

  • ✅ Raw data preservation
  • ✅ Faster time-to-insights
  • ✅ More flexible reprocessing
  • ❌ Higher storage costs
  • ❌ Requires powerful warehouse compute
  • ❌ Data quality issues propagate downstream

ETL vs ELT Comparison

AspectETLELT
Transform LocationBefore load (external)After load (in-warehouse)
ComputeSeparate processing clusterWarehouse compute
Data VolumeReduced before storageFull raw data stored
SchemaFixed upfrontFlexible, schema-on-read
LatencyHigher (transform first)Lower (load raw quickly)
CostLower storage, higher computeHigher storage, leverage warehouse
Use CaseLegacy systems, complex pre-processingModern cloud warehouses
Best ForTeradata, OracleSnowflake, BigQuery, Databricks

2. Storage Systems

Data Warehouse

Definition: Structured, schema-on-write system optimized for analytics (OLAP) workloads.

Purpose:

  • Centralized repository for structured business data
  • Optimized for complex analytical queries
  • Historical data analysis
  • Business intelligence and reporting

Architecture Characteristics:

  • Columnar storage: Optimized for aggregate queries
  • Append-only: Immutable historical records
  • Optimized joins: Efficient cross-table analytics
  • Materialized views: Pre-computed aggregations
  • Query optimization: Cost-based optimizers

Data Modeling:

  • Star Schema: Fact table + dimension tables
  • Snowflake Schema: Normalized dimension tables
  • Fact Tables: Metrics, measurements (sales, clicks)
  • Dimension Tables: Descriptive attributes (users, products, time)

Examples:

  • Snowflake (cloud-native)
  • Amazon Redshift (AWS)
  • Google BigQuery (GCP)
  • Azure Synapse Analytics
  • Teradata (on-premise/legacy)

When to Use:

  • Structured business analytics
  • BI dashboards and reports
  • SQL-heavy workloads
  • Historical trend analysis
  • Regulatory reporting

Trade-offs:

  • ✅ Excellent query performance for analytics
  • ✅ SQL interface familiar to analysts
  • ✅ Strong consistency guarantees
  • ✅ Optimized for aggregations and joins
  • ❌ Schema changes are costly
  • ❌ Not ideal for unstructured data
  • ❌ Higher cost per GB than data lakes
  • ❌ Not designed for real-time updates

Data Lake

Definition: Repository that stores raw data in native format - structured, semi-structured, and unstructured.

Purpose:

  • Store massive volumes of raw data cheaply
  • Support diverse data types and formats
  • Enable exploratory analytics and ML
  • Serve as single source of truth

Schema Approach:

  • Schema-on-read: Structure applied at query time
  • Format flexibility: JSON, CSV, Parquet, Avro, logs, images, videos
  • No upfront modeling: Store first, structure later

Examples:

  • AWS S3 + Athena/Glue
  • Azure Data Lake Storage (ADLS)
  • Google Cloud Storage + BigQuery
  • Hadoop HDFS (on-premise)

Common Formats:

  • Parquet: Columnar, compressed, great for analytics
  • Avro: Row-based, schema evolution support
  • ORC: Optimized columnar format for Hive
  • JSON: Flexible but inefficient for large scale

When to Use:

  • Storing raw logs, events, clickstreams
  • IoT sensor data
  • Machine learning training data
  • Unstructured data (images, videos, documents)
  • Data science exploration
  • Cost-effective long-term storage

Challenges:

  • Data governance and cataloging
  • Data quality consistency
  • Query performance can be poor
  • Difficult to maintain ACID properties
  • "Data swamp" risk without proper management

Trade-offs:

  • ✅ Very low storage cost
  • ✅ Handles any data type
  • ✅ Massive scalability
  • ✅ Great for ML and data science
  • ❌ No ACID guarantees (traditionally)
  • ❌ Poor query performance without optimization
  • ❌ Governance challenges
  • ❌ Requires data catalog tools

Lakehouse Architecture

Definition: Hybrid approach combining data lake flexibility with data warehouse performance and reliability.

Purpose:

  • Single platform for all data workloads
  • ACID transactions on data lake storage
  • Schema enforcement with flexibility
  • Support both BI and ML workloads

Key Technologies:

  • Delta Lake (Databricks)
  • Apache Iceberg (Netflix/Apple)
  • Apache Hudi (Uber)

Features:

  • ACID transactions on object storage
  • Time travel and versioning
  • Schema enforcement and evolution
  • Unified batch and streaming
  • Z-ordering and data skipping

Medallion Architecture Layers:

When to Use:

  • Need both data lake economics and warehouse performance
  • Supporting diverse teams (data scientists + analysts)
  • Want to eliminate data silos
  • Need strong consistency with S3-like costs

Trade-offs:

  • ✅ Best of both worlds (lake + warehouse)
  • ✅ Lower cost than pure warehouse
  • ✅ ACID guarantees on cheap storage
  • ✅ Eliminates data duplication
  • ❌ More complex to set up
  • ❌ Newer technology (less mature)
  • ❌ Requires learning new concepts

3. Processing Engines

Apache Spark

Definition: Distributed compute engine for large-scale data processing (batch and streaming).

Purpose:

  • Process massive datasets in parallel
  • Complex transformations and aggregations
  • Machine learning pipelines
  • Unified batch and streaming

Architecture Components:

  • Driver: Coordinates work, builds execution plan
  • Executors: Perform computations on worker nodes
  • Cluster Manager: Resource allocation (YARN, Kubernetes, Mesos)

APIs:

  • Spark SQL (DataFrames) - Structured data processing
  • RDD (Resilient Distributed Datasets) - Low-level API
  • Structured Streaming - Stream processing
  • MLlib - Machine learning library
  • GraphX - Graph processing

Processing Model:

When to Use:

  • ETL pipelines with complex logic
  • Large-scale data transformations (TB/PB scale)
  • Machine learning at scale
  • Batch processing of historical data
  • Stream processing with Structured Streaming (second-level latency acceptable)

Common Deployment Pattern:

Trade-offs:

  • ✅ Massive scalability (handles PB-scale data)
  • ✅ Rich transformation APIs (SQL, DataFrame, RDD)
  • ✅ Supports multiple languages (Scala, Python, Java, R)
  • ✅ Unified batch and streaming
  • ✅ In-memory processing for speed
  • ❌ High operational complexity
  • ❌ Expensive for small workloads
  • ❌ Requires cluster management expertise
  • ❌ Higher latency than true streaming (Flink)

Snowflake

Definition: Cloud-native data warehouse with separated compute and storage architecture.

Architecture Innovation:

  • Storage Layer: Centralized, automatic replication, micro-partitions
  • Compute Layer: Multiple independent virtual warehouses
  • Cloud Services: Metadata, optimization, security, transaction management

Key Features:

  1. Separation of Compute and Storage
  1. Zero-Copy Cloning: Instant data copies without storage duplication
  2. Time Travel: Query historical data (up to 90 days)
  3. Multi-cluster Warehouses: Auto-scaling for high concurrency
  4. Data Sharing: Share live data between accounts without copying
  5. Secure Views: Row/column-level security

When to Use:

  • Business intelligence and analytics
  • Data warehousing with variable workloads
  • Need to scale read and write independently
  • Multiple teams with different compute needs
  • Cross-organization data sharing
  • Ad-hoc analytics with varying concurrency

Design Patterns:

  • Create separate warehouses for ETL, BI, Data Science
  • Use clustering keys for large tables (>1TB)
  • Materialize views for repeated aggregations
  • Partition by time for time-series data
  • Use multi-cluster warehouses for high concurrency

Trade-offs:

  • ✅ Scales compute independently from storage
  • ✅ No infrastructure management (fully managed)
  • ✅ Excellent concurrency handling
  • ✅ Pay only for compute used (per-second billing)
  • ✅ Automatic optimization and tuning
  • ❌ Can be expensive at scale (compute costs add up)
  • ❌ Vendor lock-in
  • ❌ Not ideal for real-time updates (optimized for batch/micro-batch)
  • ❌ Limited support for unstructured data

4. System Design Considerations

Data Flow Architecture

Typical Modern Data Stack:


Key Design Questions

When designing a data analytics system, always ask these questions:

1. Data Characteristics

  • Volume: How much data per day/hour/minute?
  • Velocity: Real-time, near real-time, micro-batch, or batch?
  • Variety: Structured, semi-structured, or unstructured?
  • Veracity: What are data quality and consistency needs?

2. Access Patterns

  • Read-heavy or write-heavy?
  • Query latency requirements? (milliseconds vs seconds vs minutes)
  • Number of concurrent users?
  • Ad-hoc vs predefined queries?
  • Point queries vs analytical aggregations?

3. Processing Requirements

  • Transformation complexity? (simple filters vs complex joins)
  • Need for joins across multiple sources?
  • Stateful vs stateless processing?
  • Reprocessing/backfill requirements?
  • Data lineage and audit needs?

4. Scalability Needs

  • Expected data growth rate?
  • Geographic distribution requirements?
  • Peak vs average load patterns?
  • Data retention and archival policies?

Storage Selection Matrix

ScenarioBest ChoiceReason
Raw logs & ML dataData Lake (S3/ADLS)Schema-on-read, cost-effective, supports any format
Business reports & BIData Warehouse (Snowflake/BigQuery)Optimized for OLAP, SQL interface
Complex ETL pipelinesSpark / AWS GlueDistributed compute, handles TB/PB scale
Unified raw + analyticsLakehouse (Delta/Iceberg)Combines lake economics with warehouse performance
Real-time analyticsKafka + Flink + WarehouseLow latency streaming pipeline
High concurrency BISnowflake / BigQueryScales compute separately, multi-cluster support
Operational analyticsDruid / ClickHouseFast aggregations, sub-second queries
Time-series dataInfluxDB / TimescaleDB / DruidOptimized for time-based queries
Search & text analyticsElasticsearchFull-text search, log analytics
Graph relationshipsNeo4j / Amazon NeptuneGraph traversal, relationship queries

Scaling Strategies

Ingestion Scaling

Strategies:

  • Message queues (Kafka, Kinesis) for buffering
  • Partition by key for parallel processing
  • Rate limiting at source
  • Back-pressure mechanisms
  • Dead letter queues for failed messages

Storage Scaling

Strategies:

  • Separate compute from storage (Snowflake model)
  • Horizontal partitioning (sharding by key)
  • Time-based partitioning (year/month/day)
  • Tiered storage (hot/warm/cold based on access patterns)
  • Compression and columnar formats

Query Performance Optimization

Techniques:

  • Materialized views for repeated aggregations
  • Clustering keys / Sort keys for common filters
  • Data skipping with statistics (min/max, bloom filters)
  • Result caching for identical queries
  • Query federation across multiple sources
  • Partition pruning to scan only relevant data

Concurrency Handling

Strategies:

  • Multi-cluster compute (Snowflake multi-cluster warehouses)
  • Read replicas for read-heavy workloads
  • Connection pooling to manage connections efficiently
  • Query queuing and prioritization
  • Workload management (separate ETL from BI queries)

5. Modern Data Architecture Patterns

Lambda Architecture

Concept: Separate batch and speed layers for comprehensive data processing.

Layers:

  • Batch Layer: Historical data processed with Spark/MapReduce on HDFS/S3

    • Complete and accurate
    • High latency (hours/days)
    • Immutable dataset
  • Speed Layer: Real-time data processed with Flink/Storm from Kafka

    • Low latency (seconds/minutes)
    • Approximate results
    • Compensates for batch layer lag
  • Serving Layer: Merged views from both layers (Druid, Cassandra, HBase)

    • Queries combine batch and real-time data
    • Application-facing API

Purpose: Handle both batch and real-time processing with different latency/accuracy trade-offs

Use Cases:

  • Real-time dashboards with historical context
  • Fraud detection with learning from history
  • Recommendation systems

Challenges:

  • Complexity of maintaining two separate code paths
  • Data consistency between layers
  • Operational overhead

Kappa Architecture

Concept: Simplified approach treating everything as a stream.

Simplified Approach:

  • Single stream processing pipeline for all data
  • Replayable message queue (Kafka with long retention)
  • Same code handles both real-time and batch

Components:

  • Stream Storage: Kafka with days/weeks of retention
  • Processing: Flink, Spark Streaming, Kafka Streams
  • Serving: Cassandra, Elasticsearch, Druid

Advantages:

  • Single codebase (no duplicate logic)
  • Easier to maintain and debug
  • Flexibility to reprocess by replaying stream

When to Use:

  • All data can be modeled as events/streams
  • Team has strong stream processing expertise
  • Want to avoid complexity of Lambda

Limitations:

  • Requires replayable message queue
  • Historical reprocessing can be slow
  • May not suit all use cases

Medallion Architecture (Lakehouse)

Concept: Progressive data refinement through layers (Bronze → Silver → Gold).

Layer Characteristics:

LayerData QualitySchemaPurposeExample
BronzeRaw, as-isFlexibleIngestion, lineageRaw JSON logs, CDC events
SilverCleaned, validatedEnforcedStandardized business dataDeduplicated users, validated transactions
GoldAggregated, enrichedOptimizedAnalytics, ML featuresDaily sales by region, user behavior features

Benefits:

  • Clear data lineage and quality progression
  • Different SLAs per layer (Bronze: append-only, Silver: updates allowed, Gold: optimized)
  • Supports both exploratory and production workloads
  • Incremental quality improvement
  • Easy debugging (can trace back through layers)

Implementation:

Bronze: Raw Kafka events → Delta table (append-only)
Silver: Deduplicate, validate, join dimensions → Delta table (upserts)
Gold: Aggregate by time/dimension, create features → Delta table (optimized)

6. Interview Checklist

When designing a data analytics platform in an interview:

Step 1: Clarify Requirements (5-10 minutes)

Questions to Ask:

  • What are the primary data sources? (databases, APIs, logs, IoT)
  • What's the data volume? (MB/GB/TB per day)
  • What's the data velocity? (real-time, batch, micro-batch)
  • What are the query latency requirements? (milliseconds, seconds, minutes)
  • How many concurrent users? (10s, 100s, 1000s)
  • What are the retention needs? (days, months, years)
  • Are there compliance/regulatory requirements? (GDPR, HIPAA)
  • What's the budget? (cost constraints)

Step 2: Design Ingestion (5-10 minutes)

Decisions to Make:

  • Batch vs streaming ingestion?
  • Need for message queue? (Kafka/Kinesis)
  • CDC for databases? (Debezium)
  • Schema validation strategy?
  • Error handling and dead letter queues?
  • Data partitioning strategy?

Step 3: Choose Processing Approach (5 minutes)

Decisions to Make:

  • ETL vs ELT approach?
  • Processing engine? (Spark, Flink, dbt)
  • Transformation complexity?
  • Orchestration tool? (Airflow, Dagster, Prefect)
  • Incremental vs full refresh?

Step 4: Select Storage (10 minutes)

Decisions to Make:

  • Data lake for raw data?
  • Warehouse for analytics?
  • Lakehouse for unified approach?
  • Partitioning strategy? (time-based, key-based)
  • Data retention and archival?
  • File format? (Parquet, Avro, ORC)

Step 5: Plan Consumption (5 minutes)

Decisions to Make:

  • BI tool integration? (Tableau, Looker, Power BI)
  • API layer needed? (REST, GraphQL)
  • ML model serving? (Feature store)
  • Data access controls? (RBAC, row-level security)
  • Caching strategy?

Step 6: Address Non-Functional Requirements (5-10 minutes)

Considerations:

  • Scalability strategy? (horizontal, vertical)
  • Monitoring and alerting? (data quality, pipeline health)
  • Data quality checks? (validation, anomaly detection)
  • Disaster recovery plan? (backups, replication)
  • Security and compliance? (encryption, access control)
  • Cost optimization? (tiered storage, auto-scaling)

7. Common Interview Scenarios

Scenario 1: Real-time Analytics Dashboard

Problem Statement: "Design a system to display user activity metrics on a website in near real-time. The dashboard should show active users, page views, clicks, and conversions updated every few seconds."

Requirements Clarification:

  • Traffic: 100K requests/second
  • Latency: Display updates within 5-10 seconds
  • Metrics: Active users (1-min window), page views, clicks, conversions
  • Retention: Hot data (7 days), historical data (1 year)
  • Users: 100 concurrent dashboard viewers

Design:

Key Decisions:

  1. Ingestion Layer:

    • Kafka for high-throughput event ingestion (handles 100K events/sec easily)
    • Partitioned by user_id for parallel processing
  2. Stream Processing:

    • Flink for stateful stream processing with exactly-once guarantees
    • 5-second tumbling windows for aggregations
    • Maintains state for active user tracking (1-minute window)
  3. Storage Strategy:

    • Redis: Hot data (last 5 minutes) for dashboard queries
    • Snowflake: Historical data for trend analysis
    • Dual write pattern: real-time to Redis, batch to Snowflake
  4. API Layer:

    • WebSocket/Server-Sent Events for pushing updates to dashboard
    • Queries Redis for real-time metrics
    • Queries Snowflake for historical context

Scalability:

  • Kafka partitions: 10 partitions for parallelism
  • Flink parallelism: 5 task managers
  • Redis cluster: 3 nodes with replication
  • Auto-scaling for API layer based on WebSocket connections

Scenario 2: Data Lake for Machine Learning

Problem Statement: "Design a data lake to store and process diverse data sources for training machine learning models. Data includes user behavior logs, transaction records, product catalog, and customer support tickets."

Requirements:

  • Data sources: Logs (JSON), Databases (MySQL), Files (CSV), Text (tickets)
  • Volume: 10 TB/day
  • ML use case: Customer churn prediction, product recommendations
  • Data scientists: 50 users running ad-hoc queries
  • Training frequency: Daily batch jobs

Design:

Key Decisions:

  1. Storage Format:

    • Bronze (Raw): Original format (JSON, CSV) for lineage
    • Silver (Curated): Parquet format for efficient analytics
    • Partitioning: By date (/year=2024/month=01/day=15/)
  2. Schema Management:

    • Glue Crawler for automatic schema discovery
    • Glue Data Catalog as centralized metadata store
    • Schema evolution support with Parquet
  3. Feature Engineering:

    • Spark jobs in AWS Glue for complex transformations
    • Feature Store for ML features (training + serving)
    • Separate training and inference pipelines
  4. Access Patterns:

    • Data Scientists: Athena for SQL exploration, Jupyter notebooks
    • ML Training: SageMaker reading from S3/Feature Store
    • Orchestration: Airflow for scheduling ETL and training jobs
  5. Data Governance:

    • Data lineage tracking with Apache Atlas
    • S3 lifecycle policies for cost optimization (hot → cold storage)
    • IAM roles for fine-grained access control

Cost Optimization:

  • S3 Intelligent-Tiering for automatic cost optimization
  • Parquet compression (Snappy) reduces storage by 70%
  • Athena charges per query (pay only for scanned data)
  • Spot instances for Spark/SageMaker training jobs

Scenario 3: Enterprise Data Warehouse

Problem Statement: "Design a centralized data warehouse for a retail company. The warehouse should consolidate data from multiple OLTP databases, support 500 concurrent BI users, and enable complex analytical queries for sales reporting, inventory management, and customer analytics."

Requirements:

  • Data sources: 5 MySQL databases (orders, inventory, customers, products, stores)
  • Update frequency: Every 15 minutes
  • Query complexity: Complex joins across 10+ tables, window functions
  • Users: 500 concurrent BI users
  • SLA: 95% of queries under 10 seconds
  • Data retention: 5 years

Design:

Key Decisions:

  1. Data Ingestion Strategy:

    • CDC with Debezium: Captures changes from MySQL databases
    • Kafka: Buffers change events, enables replay
    • 15-minute micro-batches: Balances freshness with load
    • Separate ETL warehouse from BI warehouse for workload isolation
  2. Data Modeling:

    • Star schema for query performance (denormalized dimensions)
    • SCD Type 2 for customer dimension (track history)
    • Clustering keys: date for fact_sales, store_id for fact_inventory
    • Materialized views: Pre-compute common aggregations
  3. Snowflake Architecture:

    • ETL Warehouse: X-Large, scheduled for 15-min loads
    • BI Warehouse: Large with multi-cluster (1-10) for 500 users
    • Auto-scaling: Automatically adds clusters during peak usage
    • Auto-suspend: Warehouses suspend after 5 minutes of inactivity
  4. Performance Optimization:

    • Clustering keys on large tables (>100 GB)
    • Materialized views for daily/weekly aggregations
    • Result caching (24 hours)
    • Search optimization service for point lookups
  5. Semantic Layer (dbt):

    • Centralized business logic and metric definitions
    • Data quality tests (not null, unique, referential integrity)
    • Documentation and lineage
    • Incremental models for large fact tables
  6. High Availability:

    • Snowflake multi-AZ deployment
    • Kafka replication factor: 3
    • Spark cluster with multiple workers
    • BI tool load balancing

Query Performance:

  • 95th percentile query time: 8 seconds
  • Cache hit rate: 60% (repeated queries)
  • Concurrent query support: 500+ users
  • Auto-scaling responds to load in <1 minute